Learning
= General The typical supervised learning problem is: given pairs (X_1, Y_1) , ..., (X_n, Y_n) , where X are called ''input'', or ''explanatory'', variables, and Y are called ''output'', or ''dependent'', variables, and novel examples of input variables X^'_1 , ..., X^'_m , find \hat Y^'_1 , ..., \hat Y^'_m which resemble real Y^'_1 , ..., Y^'_m as close as possible. Each instance of (X_i, Y_i) is called an ''observation''. "As close as possible" often means minimizing RMSE (root mean square error), but other metrics are used as well. The set of pairs (X_i, Y_i) is called the ''training set''. The set of novel examples X^'_1 , ..., X^'_m for which Y's are to be determined is called the ''test set''. The error on the test set is called the ''generalization error''. The problem cannot be solved in general case. But for some classes of joint distribution \rho(X,Y) , optimal or suboptimal algorithms exist, and this turns to be useful for wide classes of datasets. In the simplest case, X_i is a vector of ''features'', all vectors have the same length. Each feature can be quantitative (i.e. car weight), qualitative (i.e. car type), or ordered categorical (i.e. "small", "medium", or "large"). The dependent variable(s) can be quantitative or qualitative as well. Sometimes similar problems, like estimating distribution of Y given novel X, need to be solved. Some learning algorithms depend on the parameters like gradient descent rate which cannot be determined on the training set directly. Such parameters are called ''metaparameters''. Abbreviatures * PRSS - Penalized Residual Sum of Squares. * RSS - Residual Sum of Squares. Bias and variance Suppose we have: * Joint distribution \rho(X,Y) * Some input point X_0 * Number of observations N * Some learning algorithm A Infinite number of times we perform the following experiment: * Create a training set T_i , randomly and independently selecting N observations from the joint distribution. (Here i is the number of the experiment.) * Randomly select Y_i(X_0) from the conditional distribution \rho^\prime(Y|X=X_0) = \rho(X_0,Y) . * Train the algorithm A on the training set T_i and obtain \hat Y_i(X_0) We want to find the error (Y_i-\hat Y_i)^2 , averaged over all '''i'''. In order to find it, we first describe Y_i as a sum of its mathematical expectation and noise: : Y_i = f(X_0) + \epsilon_i , where f(X) is the mathematical expectation of Y(X), while \epsilon_i is a noise. Similarly, we can describe \hat Y_i as : \hat Y_i = E(\hat Y(X_0)) + \Delta Y_i , where E(...) means mathematical expectation, that is averaging over all the experiments. By definition of mathematical expectation, E(\Delta Y)=0 Now we have: : E((\hat Y-Y)^2) = E(E(\hat Y(X_0))-f(X_0) - \Delta Y + \epsilon)^2) Since \epsilon is independent on other terms and its mathematical expectation is 0, this equals to: : E((E(\hat Y(X_0))-f(X_0) - \Delta Y)^2 + E(\epsilon^2) But E(\hat Y(X_0))-f(X_0) is independent on i. Let us call it B . Then the first term is : E((B - \Delta Y)^2) = B^2 - 2BE(\Delta Y) + E((\Delta Y)^2) Since E(\Delta Y)=0 , we finally obtain: : E((\hat Y-Y)^2) = B^2 + E((\Delta Y)^2) + E(\epsilon^2) Here B is called ''bias'', E((\Delta Y)^2) is ''variance'', and \epsilon is ''noise''. Generally, the more complicated the function \hat Y(X) (i. e. the more parameters it has), the lower bias and the higher variance. As for the noise, it is irreducible and independent on the algorithm or number of observations. Selecting metaparameters To select the number of parameters, two alternative criteria can be used * ''Akaike Information Criterion'' (AIC) - -2\ln L + k * ''Bayesian Information Criterion'' (BIC) - -2\ln L + k \ln n where L is likelihood, k is number of free parameters, n is number of observations in the training set. According to wikipedia's article [[wikipedia:Akaike information criterion|Akaike information criterion]], AIC seems to be better than BIC. * ''Vapnik-Cherniakovskis dimension'' is the maximal number of points which can be always separated by the model, no matter how they are categorized and what are their coordinates. For VC dimension h, and N observations, define \rho=h/N , then the effective error is \bar{\mathrm}\left(1-\sqrt{\rho-\rho\log\rho+{\log N\over N}}^{-1}\right)_+ . It seems to be worse than AIC or BIC. * Cross-validation. * Bootstrap: 0.368\bar\mathrm{err}_{ins}+0.368\bar\mathrm{err}_{outs} Function approximation A common approach is to approximate Y as : Y=f_\theta(X) where f_\theta is a parametrized family of functions. The learning is finding the most suitable \theta . If each \theta has the same prior probability, the maximal likelihood is for \theta minimizing : \mathrm{RSS}=\sum_i(f_\theta(X_i)-Y_i) Here RSS means Residual Sum of Squares. Often, however, prior probability is different for different \theta , therefore we use ''Penalized Residual Sum of Squares'' (''PRSS''). For example, since functions with high second derivative are improbable, the following PRSS is used for ''smoothing splines'': : \mathrm{PRSS}=\sum_i(f_\theta(X_i)-Y_i) + \lambda\int (f''(X))^2 dx Here \lambda is a metaparameter of learning. Linear model Let all features be numerical, and Y be a scalar. The linear model assumes linear dependence of Y on X: : \hat Y=\beta X + \beta_0 Adding the new feature X_0=1 , we have : \hat Y=\beta X Then : \textrm{RSS}(\beta) = \sum_i\left(y_i-\beta x_i\right)^2 In matrix form: : \mathrm{RSS}(\beta) = \left(\textbf{y}-\textbf{X}\beta\right)^T \left(\textbf{y}-\textbf{X}\beta\right) Minimizing it, we obtain the best fit: : \hat \beta = (\mathbf{X}^T X)^{-1} \mathbf{X}^T \mathbf{y} Here, X^T X is called the ''covariance matrix''. Distribution of the estimated coefficients If : y_i = \beta\mathbf{x}_i + \epsilon_i , where : \epsilon \sim N(0, \sigma) and noise for different observations is independent, we can substitute this into the best fit formula and obtian: : E(\hat beta)=\beta : \operatorname{Var}(\hat beta) = (X^T X)^{-1}\sigma^2 The typical estimation of \sigma^2 is: : \hat\sigma^2 = {1 \over N-p-1} \sum_i (y_i - \hat y_i)^2 Here p is the number of dimensions, and this formula has the attractive feature: : E(\hat\sigma^2) = \sigma^2 Since \beta is a linear function of (\epsilon_1,...,\epsilon_n) , which is normal centered at zero, the distribution of \hat\beta must be also normal. But we already know its mean and variance, therefore \hat\beta\sim N(\beta, (X^T X)^{-1}\sigma^2) Null hypothesis In order to test whether i-th coefficient is significantly non-zero, we should calculate : z_i = {\beta_i \over \sqrt{\left((X^T X)^{-1}\right)_{ii}}\sigma} This number is called ''Z-score''. If i-th coefficient is zero, z_i \sim N(0,1) . Gauss-Markov theorem The linear model has the least variance among all linear unbiased estimators. Indeed, the mean squared error is the sum of squared bias, variance, and mean squared noise. Considered the linear model and any other ubiased linear estimator, their bias is the same (zero), and their mean squared noise is also the same (it is the same for all models). Since another model has higher mean squared error, it also has higher variance. Regularization Dropping or penalizing the variables whose statistical significance is poor may improve the generalization error for two reasons: * The actual coefficient may be actually zero (or very small comparing to its estimation). * The actual coefficient may be far away from its estimation. There are two kinds of regularization: subset selection (every variable is either retained or discarded) and shrinkage (some variables are penalized, but not discarded). For all regularization methods there is a metaparameter (number of variables retained, the overall penalty multiplier etc.) It can be determined via training/validation division, cross-validation, AIC/BIC critheria. Subset selection Subset selection is also good, because it is useful for interpreting the results. * Best-subset selection: check all possible subsets, and choose the best one for each size. The size is our metaparameter. Disadvantage: need 2^p linear regressions (unless p>N, in which case we need "only" \binom{p}{n} linear regressions). Note however that the covariance matrix should be calculated only once, we only need to take its submatrices. * Forward stepwise: sequentially add the predictor which most improves the fit. Can use QR-decomposition for the current N×q submatrix of X (q is the current number of predictors) to rapidly find one. This may be better than the best-subset selection in terms of generalization error, and it is certainly faster. * Backward stepwise: start with the complete set of predictors, and sequentially drop the predictor with the smallest absolute Z-score. Advantage: easier to implement than the forawrd stepwise. Disadvantage: does not work for p>N. (For p=N, Z-scores are undefined, but we can try to delete each variable and check which of them retains the minimal residual error.) Another disadvantage: if p is big, and the number of variables to retain is small, can be more computationally intensive than the forward stepwise. * Hybrid approaches: at each iteration, do either forward or backward step depending on AIC, for example. * Forward stagewise: at each step, select the variable most correlated with the residuals, run the regression using this variable as the only input and the residuals as the output, and add the coefficient for this variable. Repeat until all correlations with residuals are (almost) zero. This method is slow (since it can examine the same variable many times), but sometimes may provide better generalization. * Incremental forward stagewise: as above, but increase the predictor coefficient by \pm epsilon only. Shrinkage methods Generally, they are smoother than subset selection methods and may lead to less variance. * Ridge regression: apply penalty \lambda\lVert\hat\beta\rVert^2 . This makes the formula \hat\beta=\left(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}\right)\mathbf{X}^T\mathbf{y} . This is the same as multiplying SVD components of X by d_i^2\over d_i^2+\lambda , where d_i is i-th diagonal element of the diagonal matrix obtained in SVD, or d_i^2 is N times the variance of the i-th component. This method does not delete any variables. * Lasso: apply penalty \lambda\sum_i|\hat\beta_i| . Some of the coefficient may become exactly zero, and for big enough \lambda all of them are. Therefore, this method deletes variables, but does it smoothly, since each coefficient is a continuous function of \lambda . ** Can be computed using LARS algorithm ** Or start with high \lambda , at each step decrease \lambda slightly and cycle through variables until convergence. * Combination of above: apply both penalties (with different \lambda ). Can also apply penalty on all non-zero terms (best-subset). * Least angle regression: similar to the forward stepwise. The coefficients are initially zero. Find the most correlated variable, add it to the active set, and increase the coefficient until it is as correlative with the residuals as some competitor. Increase the coefficients for both (in the direction of gradient descent of the RSS), until there is another variable as correlative with the residuals as both of them. Increase the coefficients of all the three etc. Intermediate * Principal Components Regression: retain only first several components, discard the others. * Partial least squares: at m-th iteration, do the following: : \mathbf{z}_m = \sum_i <\mathbf{x}_i^{m-1}, \mathbf{y}> \mathbf{x}_i^{m-1} : run the linear regression from \mathbf{z}_m to '''y''', add to the prediction : orthogonalize x's with respect to \mathbf{z}_m Basis functions f may be approximated as a linear combination of ''basis functions'' h_1(X) , ..., h_M(X) : : f_\theta(X) = \sum_i \theta_m h_m(X) Sometimes, the basis functions are known in advance and we only need to learn the coefficients \theta_1 , ..., \theta_M . In other cases, they are also parameter-dependent. Such methods are called ''dictionary methods'' which means that we have a (possibly infinite) "dictionary" of basis functions and search the functions we need. Examples: * Splines * Radial basis functions - multidimensional kernels located at centroids; Haussian kernels are popular * Single-layer feed-forward neural nets, h_m(X) = \sigma(a_m X+b_m) , \sigma(t)=1/(1+e^{-t}) Splines For order M splines, : h_m(x) = \begin{cases} x^{m-1}, & m\le M \\ (x-x_{m-M})_+^{M-1}, & M+1 \le m \le M+K \end{cases} where a_+ = \max(a,0) . The points x_1 , ..., x_K are externally given and called ''knots''. They have (M-2) continuous derivatives. Often, M=4 is used: : h_m(x)= \begin{cases} x^{m-1}, & m\le 4 \\ (x-x_{m-M})_+^3, & 5 \le M \le K+4 \end{cases} Such splines are called ''cubic splines''. Splines of order 2 are called ''linear splines''. Natural splines ''Natural splines'' are cubic splines which are linear outside [x_1, x_K] . To construct the component functions, notice that (x-x_k)_+^3 - (x-x_K)_+^3 is a square function at [x_K, \infty] , whose second derivative is 6(x_K-x_k) . Therefore, if : d_k(x) = \frac{(x-x_k)_+^3 - (x-x_K)_+^3}{x_K-x_k} , then d_k''=6 at [x_K, \infty] , and : d_{k+1}(x)-d_k(x) is linear at [x_K, \infty] Since d_K is undefined (0/0), we have K-2 functions d_2-d_1 , ... , d_{K-1}-d_{K-2} . We also have 2 functions f(x)=1 and f(x)=x. Totally, we have K functions instead of K+4 for cubic splines. The missing 4 functions correspond to 4 constraints: square and cubic terms must be 0 at both x and x>x_K . Smoothing splines Apply the penalty \lambda\int \left(f^{\prime\prime}(x)\right)^2dx , where \lambda is a metaparameter. Multidimensional splines The basis functions are h_{ij}(\mathbf{x}) = h_i^{(1)}(x_1) h_j^{(2)}(x_2) for 2-dimensional case, similar for higher dimensions. Wavelet functions These reward sparseness rather than smoothness. The basis is: : \psi_{m,n}(t)=2^{-m/2}\psi(2^{-m}t-nb) . Sometimes, another number instead of 2 can be used. Here \psi(x)=\phi(2x)-\phi(2x-1) , \phi is called a ''mother wavelet''. If the penalty is 2\lambda|\boldsymbol{\theta}| , then : \theta_i = \operatorname{sign} y^*_i\left(|y^*_i|-\lambda\right) where y^*_i is the i-th coefficient before penalty, \mathbf{y}^*=\mathbf{W}^\mathrm{T}\mathbf{y} . A simple choice is \lambda=\sigma\sqrt{2\log n} , where \sigma is the estimated standard deviation of the noise of y's. \mathbf{y}^* can be calculated in linear time using pyramide schemes. Radial basis functions and mixture model ''Radial basis functions'' (RBF) are: : h_i(\mathbf{x}) = D(||\mathbf{x}-\mathbf{c}_i||/r_i) If all r's are the same, this may cause holes. To avoid holes, one can use ''renormalized radial basis functions'': : h_i(\mathbf{x}) = {D(||\mathbf{x}-\mathbf{c}_i||/r)\over \sum_j D(||\mathbf{x}-\mathbf{c}_j||/r)} If, instead of D, we use an arbitrary Gaussian kernels (with arbitrary covariance matrix), this is called ''mixture model''. Local models kNN KNN (or kNN) means "K Nearest Neighbours". The prediction for Y is assumed to be the average of Y in several neighbour points: : \hat Y(x) = {1\over k} \sum_{i=1}^k Y(x_i(x)) where x_i is i-th closest point to x, k is a metaparameter. Curse of dimensionality Given n (number of observations), and increasing number of features (dimension), the distance even to the closed nearest neighbour soon becomes comparable to the distance to an average point, and then becomes increasingly close to it. Therefore, when there are enough features the closest neighbours are not much closer than other points, and the kNN model does not work. It fails at O(log {n \over k}) features. Another way to formulate the same problem: if we increase the dimensionality, and want the k nearest neighbours to be at the distance not more than some fixed r (comparing to the average distance to the points), the number of observations we need is increasing exponentially. Another problem with dimensionality: the higher the dimensionality, the closer an average point will be to boundaries of the region. Not only the region where the neighbours of X reside will be big, but also X will be out of it. Kernel methods Kernel methods minimize ''weighted'' RSS, which weights points close to the given point higher than the distant points: : To obtain \hat Y(X) , :: Minimize \sideset{}{_i}\sum K(X_i,X)(Y_i-f_\theta(X_i))^2 with respect to \theta :: Set \hat Y(X) = f_\theta(X) Here K is called the kernel. The Haussian kernel is popular. If f_\theta=\mathrm{const} , we obtain the Nadaraya-Watson weighted average: : \hat Y(X) = \frac{\sum_i K(X, X_i) Y_i}{\sum_i K(X, X_i)} If f_\theta is linear function, we have the ''local linear regression'' model. Kernel methods are also prone to the dimensionality curse. Kernels and regressions * ''Epanechnikov kernel'': {3 \over 4}(1-t^2) if |t|<1 , otherwise 0 * ''Tri-cube kernel'': (1-|t|^3)^3 if |t|<1 , otherwise 0 * ''Local polynomial regression'': fit polynomials * ''ANOVA'' (analysis of variance) - \hat Y = a + \sum_i f_i(x_i) + \sum_i\sum_{j * f(X) = \alpha(Z) + \beta(Z)^\mathrm{T}(X_1, ..., X_m) , where Z=(X_{m+1},...,X_p) Classification Output G belongs to one of K classes. We are to determine the most probable class. Discriminant analysis Let f_k(X) be the density of points belonging to a class k (''class-conditional density'') and \pi_k be the prior probability of the class k. Then the probability that some point X belongs to the class k is: : P_k(X) = {f_k(X)\pi_k\over \sum_i f_i(X)\pi_i} Consider the distribution to be Gaussian: : f_k(X) = {1 \over (2\pi)^{p/2}|\Sigma_k|^{1/2}} exp\left(-{1\over 2}(X-\mu_k)^T\Sigma_k(X-\mu_k)\right) If \Sigma is the same for all groups, the model is called '''linear discriminant analysis''' (LDA), otherwise it is called '''quadratic discriminant analysis''' (QDA). The borders between classes are linear for the former and quatratic for the latter, thats why the name. Regularized discriminant analysis The formula : \hat\Sigma_k(\alpha) = \alpha\hat\Sigma_k + (1-\alpha)\hat\Sigma (the first term from QDA, the second term from LDA) is called '''regularized discriminant analysis'''. Here \alpha is a metaparameter. Space transformations For LDA, we can linearly transform the space so that \Sigma will be unitary. Then we can project all points to subspace spanned by centroids. We can further decompose the subspace into successfully optimal subspaces in terms of centroids separation, by minimizing a^T W a \over a^T B a w. r. to a, where W is within-class variance, B is between-class variance. Utilizing points with unknown class Note that class-conditional densities should satisfy the constraint P(X)=\sum_k \pi_k f_k(X) With this constraint, even points with unknown class can be used to improve prediction for these models. Logistic regression : P_k(X) = {exp(\beta_k X) \over \sum_l exp(\beta_l X)} , and \beta_K=0 Find betas with Newton-Raphson algorithm. We can drop the least significant variables, or, better, those which contribute to the log-likelihood minimally. We also can penalize betas in a similar way to the lasso model, subtracting \lambda\sideset{}{_i}\sum|\beta_i| from the log-likelihood. Separating plane We have a set (\mathbf{x}_1, y_1) , ..., (\mathbf{x}_n, y_n) , y_i\in {-1,1} . We are to find the weights \beta such that \beta x_i > 0 for all i. Rosenblatt's Perceptron Learning Algorithm For each '''misclassified''' observation, weights are updated as : \beta \leftarrow \beta + \rho x_i y_i It can be proved that the formula converges to a solution, if one exists. When the weights are updated after each observation, it is called ''stochastic gradient descent''. Optimized Separating Hyperplanes We are to maximize the distance to the closed point from either class. : \max_{\beta,M} M \;\; \mathrm{ s. t. }\;\; ||\beta||^2=1,\;\; y_i\mathbf{x_i}^\mathrm{T}\beta \ge M,\;\;i=1,...,n We can get rid of the ||\beta||^2=1 constraint: : \max_{\beta,M} M\;\; \mathrm{ s. t. }\;\; y_i\mathbf{x_i}^\mathrm{T}\beta \ge M||\beta||,\;\;i=1,...,n Indeed, since ||\beta||=1 , we can replace M by M||\beta|| . After this, since multiplying \beta by any positive constant does not affect the inequality, we can drop the constraint ||\beta||=1. Since the inequality is scalable now, we can set ||\beta||=\frac{1}{M} : : \min_\beta{1\over 2}||\beta||^2\;\; \mathrm{ s. t. }\;\; y_i\mathbf{x_i}^\mathrm{T}\beta \ge 1 Solving this with Lagrange multipliers, we obtain the dual form (''Wolfe dual''): : \max_\alpha L_D = \sum_i \alpha_i - {1\over 2}\sum_{ij}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^\mathrm{T} \mathbf{x}_j :subject to \alpha_i\ge 0 for all i This is a convex problem which can be solved. Points for which \alpha>0 are called ''support points''. = Comments